Traffic and load aware dynamic queue management

ABSTRACT

Some embodiments provide a queue management system that efficiently and dynamically manages multiple queues that process traffic to and from multiple virtual machines (VMs) executing on a host. This system manages the queues by (1) breaking up the queues into different priority pools with the higher priority pools reserved for particular types of traffic or VM (e.g., traffic for VMs that need low latency), (2) dynamically adjusting the number of queues in each pool (i.e., dynamically adjusting the size of the pools), (3) dynamically reassigning a VM to a new queue based on one or more optimization criteria (e.g., criteria relating to the underutilization or overutilization of the queue).

BACKGROUND

In the last few years, queue management systems have been proposed fordistributing incoming and outgoing traffic to and from a host through anetwork interface card (NIC) with multiple queues. FIG. 1 illustratesone such system. Specifically, it illustrates (1) multiple virtualmachines (VMs) 102 that execute on a host computer (not shown), and (2)a NIC 100 that has multiple queues. As shown in this figure, each queuehas a receive side set 104 of buffers and a transmit side set 106 ofbuffers to handle respectively incoming and outgoing traffic. The systemhas four types of queues, which are: a default queue 105, severalnon-default queues 115, LRO (large receive offload) queues 120 and RSS(receive side scaling) queues 125. The latter two types of queues arespecialty queues tied to specific hardware LRO and RSS functionalitiessupported by the NIC.

The queue management system of FIG. 1 distributes traffic to and fromthe virtual machines (VMs) across multiple queues. In this system, allVMs start out in a default queue 105. A VM is moved out from the defaultqueue to a non-default queue 115 whenever its traffic exceeds a giventhreshold. When moving a VM out from the default queue, thisimplementation always moves a VM to the least-loaded non-default queueregardless of the requirements of the VM. This causes three majorproblems.

First, since the current implementation chooses a non-default queuewithout considering the VM's traffic type, VMs with special requirementsmight be interfered by other VMs. For example, if a special VM thattransmits and receives latency-sensitive traffic, shares the same queuewith several other VMs running less latency-sensitive,throughput-intensive workloads, the latency and jitter of the special VMwill certainly be affected. Queue 150 in FIG. 1 is an example of anoverloaded queue that has traffic for both a low latency required (LLR)VM 152 and several high latency tolerating (HLT) VMs. In this situation,the LLR VM 152 might not be able to send and receive traffic within themaximum latency that it can tolerate because of the traffic of thevarious HLT VMs.

The second problem with this implementation is that it staticallyassigns fixed number of queues to one of the three different non-defaultpools of queues, which are non-default queues 115, LRO (large receiveoffload) queues 120 and RSS (receive side scaling) queues 125. In thisapproach, each pool has all of its queues assigned and allocated duringthe driver initialization. By default, each pool will get the sameamount of queues, even if the pool is in fact not in use. This resultsin a performance issue when a pool needs more queues to sustain thetraffic as the overloaded pool will never be able to take over freequeues from other pools and thus can never grow further, even if thesystem has the capacity.

The third problem is that queue assignment for a VM is one-time, i.e.,once the VM moves to a queue, it will never be moved to anothernon-default queue. This causes two issues. First, because the assignmentis one-time, if a VM later needs more resources to grow the traffic, itmight end up being limited by the utilization of its current queue. Evenif there is a less-busy queue that has more room to grow, this priorapproach does not allow the VM to take the chance. In addition, thisapproach tries to statically keep all queues busy, even if not so manyqueues are needed to serve the traffic. Since this approach has adedicated kernel context for each queue, having unnecessary number ofactive queues results in more active contexts. These active contextswill inevitably halt other contexts (such as vCPU) when an interruptarrives. Therefore, the host ends up spending more cycles doing contextswitches, which hurts VM consolidation ratio.

BRIEF SUMMARY

Some embodiments provide a queue management system that efficiently anddynamically manages multiple queues that process traffic to and frommultiple virtual machines (VMs) executing on a host. This system managesthe queues by (1) breaking up the queues into different priority poolswith the higher priority pools reserved for particular types of trafficor VM (e.g., traffic for VMs that need low latency), (2) dynamicallyadjusting the number of queues in each pool (i.e., dynamically adjustingthe size of the pools), (3) dynamically reassigning a VM to a new queuebased on one or more optimization criteria (e.g., criteria relating tothe underutilization or overutilization of the queue).

In some embodiments, the queue management system initially has a newlyinitialized VM in an unassigned, default pool. When a VM's trafficexceeds a pre-set threshold, the system determines if there is a poolmatching the VM's traffic requirement, and if so, assigns the VM to thatpool. If there is no matching pool, the system creates a new pool andassigns the VM to that pool. In situations where there are no freequeues for creating new pools, the queue management system preempts oneor more assigned queues (i.e., queues assigned to previously createdpools) and assigns the preempted queue(s) to the newly created pool.This preemption rebalances queues amongst existing pools to free up oneor more queue(s) for the new pool. In some embodiments, the rebalancingprocess across pools can be controlled by resource allocation criteriasuch as minimum and maximum size of a pool, relative priorities of thepools, etc.

Also, the queue management system can rebalance traffic within a poolbased on one or more criteria, such as CPU load of the associatedmanagement thread (e.g., kernel context), traffic type, traffic load,other real-time load metrics of the queues, etc. In some embodiments,the system uses different rebalancing criteria for different pools. Forinstance, the system might want to pack VMs on a few queues in somepools, while for other pools, it might want to distribute VMs across thequeues as much as possible. In some embodiments, the queue managementsystem has a load balancer that performs the rebalancing processperiodically and/or on special events.

When the VM's traffic falls below a threshold, the queue managementsystem of some embodiments moves the VM back to a default queue. Whenthe VM is the last VM that is using a queue in a non-default queue, thenthe system moves the last-used queue to the pool of free queues, so thatit can later be reallocated to any pools.

In addition to the VM data traffic or instead of VM data traffic, thequeue management system of some embodiments dynamically defines pools,uniquely manages each pool, dynamically modifies the queues within thepools, and dynamically re-assigns data traffic to and from non-VMaddressable nodes (e.g., source end nodes or destination end nodes) thatexecute on a host. Specifically, the system of some embodiments monitorsdata traffic for a set of VM and/or non-VM addressable nodes (e.g., dataend nodes) through the NIC of a host device. Based on this monitoring,the system specifies a pool for at least a set of the addressable nodes,and assigns a set of the queues to the pool. The system then usesdestination or source media access control (MAC) filter, or five-tuplefiltering, to direct to the assigned set of queues the data traffic thatis received by, or transmitted from, the host device for the set ofnon-VM addressable nodes.

Alternatively, or conjunctively, based on the monitoring, the system ofsome embodiments can modify the set of queues assigned to a pool for theset of the VM and non-VM addressable nodes. As mentioned above, examplesof such modifications include adding or removing a queue from the poolwhen one or more of the queues of the pool are overutilized orunderutilized. In some embodiments, the system adds a queue to the poolby preempting a queue from another pool, e.g., by using one of theabove-described preemption methodologies.

Also, alternatively or conjunctively to the above-described operations,the system can re-assign the data traffic for a VM or a non-VMaddressable node (e.g., data end node) from a first queue in the pool toa second queue in the pool, based on the monitoring. For instance, basedon the monitoring, the system of some embodiments detects that thetraffic for the VM or non-VM addressable node through the first queuefalls below a minimum threshold amount of traffic (e.g., for a durationof time). Because of this underutilization, the system switches thistraffic to the second queue. Before making this switch, the system ofsome embodiments determines that the traffic through the second queuedoes not exceed a maximum threshold amount of traffic.

Based on the monitoring, the system of some embodiments detects that thetraffic through the first queue exceeds a maximum threshold amount oftraffic (e.g., for a duration of time). Because of this overutilization,the system switches the traffic for a VM or a non-VM addressable node(e.g., data end node) from the first queue to the second queue. Again,before making this switch, the system of some embodiments determinesthat the traffic through the second queue does not exceed a maximumthreshold amount of traffic.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all inventive subject matter disclosed in this document.The Detailed Description that follows and the Drawings that are referredto in the Detailed Description will further describe the embodimentsdescribed in the Summary as well as other embodiments. Accordingly, tounderstand all the embodiments described by this document, a full reviewof the Summary, Detailed Description and the Drawings is needed.Moreover, the claimed subject matters are not to be limited by theillustrative details in the Summary, Detailed Description and theDrawing.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purposes of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 is an example of an overloaded queue that has traffic for both alow latency required (LLR) VM and several high latency tolerating (HLT)VMs.

FIG. 2 illustrates an example of the conceptual grouping of the queuesinto two priority pools based on the type of traffic or VM requirements.

FIG. 3 illustrates an example of three priority pools for threedifferent types of VMs.

FIG. 4 illustrates threads in the network virtualization layer (NVL)that are responsible for the queues in the PNIC, along with theinterrupt generation architecture of some embodiments of the invention.

FIG. 5 illustrates the queue management system of some embodiments.

FIG. 6 conceptually illustrates an overall process that a load balancerperforms in some embodiments.

FIG. 7 illustrates a queue assignment process of some embodiments.

FIG. 8 conceptually illustrates a pool adjustment process that a loadbalancer invokes periodically (e.g., every few seconds) in someembodiments.

FIG. 9 illustrates a process that is performed by the load balancer ofsome embodiments to assess a VM's utilization of its queue.

FIG. 10 illustrates an example of re-assigning a VM from a first queueto a second queue because of the underutilization of the first queue orbecause of the VM's underutilization of the first queue.

FIG. 11 illustrates an example of re-assigning a VM from a first queueto a second queue because of the overutilization of the first queue.

FIG. 12 illustrates an example of the pool balancing across the pools.

FIG. 13 illustrates a queue management system of some embodiments thatuses MAC address filtering to route data traffic of VMs and non-VM dataaddressable nodes executing on a host device to different pools ofqueues, and different queues within the pools.

FIGS. 14 and 15 illustrate examples that show some embodiments usefive-tuple filters to differentiate VOIP and video packets that aretransmitted or received by a virtual machine during a videopresentation.

FIG. 16 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it will be clear and apparent to one skilled in theart that the invention is not limited to the embodiments set forth andthat the invention may be practiced without some of the specific detailsand examples discussed.

Some embodiments provide a queue management system that efficiently anddynamically manages multiple queues that process traffic to and frommultiple virtual machines (VMs) executing on a host. This system managesthe queues by (1) breaking up the queues into different priority poolswith the higher priority pools reserved for particular types of trafficor VMs (e.g., traffic for VMs that need low latency), (2) dynamicallyadjusting the number of queues in each pool (i.e., dynamically adjustingthe size of the pools), (3) dynamically reassigning a VM to a new queuebased on one or more optimization criteria (e.g., criteria relating tothe underutilization or overutilization of the queue).

In some embodiments, the queue management system groups the queues intofour types of pools. These are:

-   -   (1) a default pool that includes in some embodiments one default        queue that is the initial queue for some or all of the VMs upon        their initialization (in other embodiments, the default pool        includes more than one default queue);    -   (2) a free pool that includes all of the unused queues (i.e.,        the queues that are not assigned to traffic to or from any VM);    -   (3) hardware-feature pool that includes queues associated with a        particular hardware feature, such as LRO and RSS;    -   (4) VM-requirement pools that include queues that serve VMs with        different kinds of requirements, such as low latency required        (LLR) VMs and high latency tolerated (HLT) VMs.

In some of these embodiments, the queue management system initially hasall the queues in an unassigned, free pool, except for one default queuethat is in the default pool. Some embodiments do not allocate thedefault queue until the first VM is initialized, while other embodimentsspecify the default queue even before the first VM is initialized.

When a VM's traffic exceeds a pre-set threshold, the system determinesif there is a pool matching VM's traffic requirement (e.g., if there isan LLR pool for an LLR VM that is exceeding its threshold), and if so,the system assigns the VM to that pool. If there is no matching pool,the system creates a new pool and assigns the VM to that pool. Whenthere are no free queues for creating new pools, the queue managementsystem preempts one or more assigned queues (i.e., queues assigned topreviously specified pools) and assigns the preempted queue(s) to thenewly created pool. This preemption rebalances queues amongst existingpools to free up one or more queue(s) for the new pool. In someembodiments, the rebalancing process across pools is based on one ormore resource allocation criteria, such as minimum and maximum size of apool, relative priorities of the pools, etc.

In addition to balancing queues across pools, the queue managementsystem of some embodiments rebalances traffic within a pool. This systemuses different criteria in different embodiments to rebalance trafficwithin a pool. Examples of such criteria include CPU load of theassociated management thread, traffic type, traffic load, otherreal-time load metrics of the queues, etc. In some embodiments, thesystem uses different rebalancing criteria for different pools. Forinstance, the system tries to pack VMs on fewer queues in some pools,while for other pools, it tries to distribute VMs across the queues asmuch as possible. In some embodiments, the queue management system has aload balancer that performs the rebalancing process periodically and/oron special events.

When the VM's traffic falls below a threshold, the queue managementsystem of some embodiments moves the VM back to a default queue. Whenthe VM is the last VM that is using a queue in a non-default queue, thenthe last-used queue is moved to the free pool of unassigned queues, sothat it can later be reallocated to any pool. Thus, under this approach,a queue is assigned to one of the non-default pools as soon as the queuegets assigned a VM, and it is assigned back to the free pool as soon asits last VM is reassigned or is shut off.

In addition to the VM data traffic or instead of VM data traffic, thequeue management system of some embodiments dynamically defines pools,uniquely manages each pool, dynamically modifies the queues within thepools, and dynamically re-assigns data traffic to and from non-VMaddressable nodes (e.g., source end nodes or destination end nodes) thatexecute on a host. Performing these operations for VMs is firstdescribed below. This discussion is then followed by a discussion ofperforming these operations for non-VM addressable nodes.

I. Different Pools for VMs with Different Requirements

As mentioned above, the queue management system of some embodimentsbreaks up the queues into different priority pools with the higherpriority pools reserved for particular types of traffic or VMs (e.g.,traffic for VMs that need low latency). FIG. 2 illustrates an example ofthe conceptual grouping of the queues into two priority pools 210 and215 based on the type of traffic or VM requirements. Specifically, itillustrates (1) a physical NIC 200 with multiple queues, and (2)multiple virtual machines (VMs) 202 with different requirements. The VMsexecute on a host computer (not shown). The traffic to and from theseVMs is distributed across the various queues of the NIC. As shown inthis figure, each queue has a receive-side set 204 of buffers and atransmit-side set 206 of buffers to handle respectively incoming andoutgoing traffic. In some embodiments, one core of a multi-coreprocessor manages each queue. Accordingly, in the example illustrated inFIG. 2, eight cores would manage the eight queues of the NIC.

In some NICs, each receive side set of buffers 204 is its own standalonequeue in the NIC. Likewise, in these NICs, each transmit-side set ofbuffers 206 is its own standalone queue in the NIC. However, even thoughthe receive side queues are separate and independent from the transmitside queues in these NIC, the queue management system of someembodiments pairs one receive side queue with one transmit side queue sothat the queue pair can be used as one queue construct for a VM. Otherembodiments, however, do not “pair” the queues. Specifically, theseother embodiments do not require all the VMs that use a receive sidequeue to use the same transmit side queue; two VM can use the samereceive side queue, but different transmit side queues. However, inorder to keep the illustrations simple, each queue that is shown inFIGS. 3-6 and 10-15 is a queue pair that includes a receive side queuepaired with a transmit side queue.

In FIG. 2, the queue management system has grouped the queues of aphysical NIC 200 into three different types of pools. These are: (1) adefault pool 205, (2) an LLR pool 210, and (3) a HLT pool 215. As shownin FIG. 2, the default pool 205 in some embodiments includes only onedefault queue. In other embodiments, it includes more than one defaultqueue. A default queue serves the traffic of the VMs that are notassigned to a queue of a non-default pool. When there is only onedefault queue (such as default queue 207), the “default” queue servesall the VMs that are not assigned to a non-default pool. In someembodiments, the default queue 207 serves all the low-traffic VMs, i.e.,serves any VM that has less traffic than a pre-set traffic threshold.These VMs have too little traffic to merit being placed in a non-defaultqueue.

In some ways, the default queue 207 can be viewed as not belonging toany pool, since from hardware point of view, this queue just serves allthe VMs that do not have a matching filters that direct their routing ofincoming and outgoing traffic to another queue. The queue managementsystem of some embodiments starts each VM on the default queue until theVM's traffic exceeds the threshold. Once the VM's traffic exceeds thethreshold, the system selects a non-default queue for the VM, and thendirects the PNIC to allocate a filter for the VM's inbound traffic, andthe virtualization layer to allocate filter for the VM's outboundtraffic. In some embodiments, the filter on the outbound traffic isbased on the source MAC address, while the filter on the inbound trafficis based on the destination MAC address. These filters direct modules inthe PNIC and the virtualization layer to route incoming and outgoingtraffic to the selected queue. It should be noted that the filters canbe based on other identifiers. For instance, the filter on the outboundtraffic in some embodiments is based on the software forwarding elementport ID.

The allocation of the filters allows a queue that is conceptuallyassigned to a non-default pool to serve the traffic of a VM. In otherwords, by specifying the filter, the system links a VM's traffic with aqueue. Moreover, by associating a queue with a conceptual “pool” that itmaintains, the queue management system can apply different managementprocesses to differently manage the queues in the different pools andthereby to manage the different VMs with the different requirements.This is further described below in Section II.

When the pool does not exist for a queue that is to be allocated to thepool, the queue management system first defines the pool and thenallocates the queue to it, as further described below. The LLR and HLTpools 210 and 215 are two pools that are created to address specific VMrequirements. The LLR pool 210 includes the queues that are meant toserve the LLR VMs while the HLT pool 215 includes the queues that aremeant to serve the HLT VMs. As shown in FIG. 2, the LLR VMs 250 and 252send and receive their packets through the LLR queues 222 and 224 of theLLR pool 210, while the HLT VMs 260 send and receive their packetsthrough the HLT queues 230-238 of the HLT pool 215.

In some embodiments, the queue management system conceptually definesthese pools by particularly allocating the filters (on both the transmitand receive sides), so that the LLR VM traffic goes through one set ofthe queues, while the HLT VMs go through another set of queues. For theLLR VMs, the queue management system of some embodiments optimizes theallocation of the LLR queues to ensure that the VM traffic is as spreadout as possible across the LLR queues so that the LLR VMs are minimallyimpacted by traffic of other VMs. On the other hand, for the HLT VMs,the system of some embodiments optimizes the allocation of the HLTqueues by trying to reduce the number of HLT queues that are used by theHLT VMs, in order to keep more of the free queues available for newallocations.

By separating the queues for the LLR VMs from the queues for the HLTVMs, the queue management system of some embodiments allows the trafficto and from the LLR VMs to go through the less congested LLR queues. Assuch, the LLR VMs can have lower latency in sending and receiving theirpackets.

Even though the example illustrated in FIG. 2 shows three types ofpools, one of ordinary skill will realize that other embodiments usefewer or additional pools. For instance, instead of providing both LLRand HLT pools, the queue management system of some embodiments onlydefines an LLR pool, and directs all the HLT traffic through the defaultpool. To handle all such HLT traffic, the queue management system ofsome embodiments defines multiple queues in the default pool.

In addition to the LLR pool and/or HLT pool, the queue management systemof some embodiments also defines LRO and RSS pools (like thoseillustrated in FIG. 1) to support LRO and RSS hardware features of theNIC. Also, the LLR and HLT pools are examples of pools that arespecified based on VM requirements, with the LLR being a higher prioritypool than the HLT as the LLR is intended for LLR VM traffic. In otherembodiments, the queue management system defines more than two prioritypools to handle more than two types of VM requirements. For instance,FIG. 3 illustrates an example of three priority pools for threedifferent types of VMs. The pools include a high pool 305, a medium pool310 and a low pool 315, and their queues respectively handle traffic forhigh priority (HP) VMs 320, medium priority (MP) VMs 325 and lowpriority (LP) VMs 330. In some embodiments, a higher priority pools mayhave fewer VMs and/or less overall traffic per queue, than a lowerpriority pool.

Also, instead of defining an LLR or an HLT pool or in conjunction withdefining such a pool, the queue management system of some embodimentsdefines a high interrupt (HI) pool or a low interrupt (LI) pool. In thiscontext, interrupts refer to signals generated by the PNIC to threads inthe network virtualization layer (NVL) that are responsible for thequeues in the PNIC. FIG. 4 illustrates such threads along with theinterrupt generation architecture of some embodiments of the invention.In some embodiments, a thread is a process that is initiated to performa set of tasks (e.g., to manage receive- or transmit-side modules in anetwork stack for a VM). Also, in some embodiments, different threadscan execute as independent processes on different threads of amulti-threaded processor, and/or on different cores of a multi-coreprocessor.

FIG. 4 shows a PNIC 400 that includes (1) several queues for receivingincoming traffic that needs to be relayed to the VMs, (2) a receive-side(RX) processing engine 405 for managing the assignment of incomingtraffic to the queue, (3) a queue monitor 410 for monitoring the statusof the queues, and (4) an interrupt generator 430 for generatinginterrupts that direct receive-side threads of the NVL to retrieve datastored in the queues. The RX processing engine includes a MAC filter420, which as described above and further described below is used topair a VM's incoming traffic to a queue.

FIG. 4 also shows a receive-side (RX) thread 427 for each queue in thePNIC. In some embodiments, the threads 427 are part of a networkvirtualization layer that manages traffic to and from the virtualmachines through the PNIC. The queue management system is part of thenetwork virtualization layer in some embodiments.

Each thread manages its associated queue. Each time a queue is beingfilled up with received packets, the PNIC's queue monitor 410 detectsthis and directs the PNIC's interrupt generator 430 to generate aninterrupt for the core that executes the queue's thread 425, in order todirect the thread to retrieve the packets from the queue. The generatorsends this interrupt through an API of a PNIC driver 435, which in turngenerates an interrupt for the core. Each time a queue's thread isinvoked for this operation, the core that manages the queue and executesits thread has to interrupt another task that it is performing, in orderto execute the thread so that it can retrieve the packets from thequeue. Such interruptions affect the processor's operational efficiency.

Accordingly, to increase the processor's operational efficiency and/orreduce latency for critical VMs, the queue management system of someembodiments defines an HI pool or an LI pool. A HI pool is a pool thatcontains queues that carry traffic that needs to be delivered with lowerlatency, while a LI pool is a pool that contains queues that carrytraffic that can tolerate more latency.

In some embodiments, a thread that manages a HI pool will receive moreinterrupts than a thread that manages a LI pool, and as such it isoperated in some embodiments by a processor core that has less load onit than the core that operates a LI pool. Specifically, to account forthe desired low latency of LLR VMs, the queue management system of someembodiments designates a queue that handles traffic for a LLR VM as aqueue in a HI pool. Based on this designation, it then can perform avariety of tasks to optimize the management of this queue and themanagement of the core that executes this queue's thread. For instance,the queue management system in some embodiments reduces the number ofVMs that are assigned to this HI queue or only assigns to this queue VMsthat also are critical and need as low a latency. In conjunction orinstead of this, the queue management system of some embodiments canalso direct the processor's scheduler to reduce the load on the corethat executes the thread for this HI queue, and/or can direct the PNICto generate interrupts sooner for this queue.

FIG. 4 illustrates an example of reducing the load on a core thatexecutes a HI pool. Specifically, in this example, queue managementthreads TQM1 and TQM2 are for managing high priority queues HPQ1 andHPQ2, which are HI queues. These threads are assigned to cores 1 and 2of a multicore processor 450. As shown in FIG. 4, the load on thesecores are relatively light, as core 2 only executes thread TQM2, whilecore 1 executes thread TQM1 and a non-queue management thread TNQM1. Theloads on these cores is in contrast to the load on core 5, whichexecutes the queue management thread TQM5 (for the low priority queue 5(LPQ5)) but also executes three other non-queue management threadsTNQM2-TNQM4.

To account for the higher acceptable latency of HLT VMs, the queuemanagement system of some embodiments designates a queue that handlestraffic for an HLT VM as a LPQ in a LI pool. Based on this designation,it then can perform a variety of tasks to optimize the management ofthis queue and the management of the core that executes this queue'sthread. For instance, the queue management system in some embodimentsmay assign more VMs to this queue. In conjunction or instead of this,the queue management system of some embodiments also notifies theprocessor's scheduler that it can schedule additional threads onto thecore that executes the thread for this queue, and/or directs the PNIC togenerate less interrupts for this queue (i.e., to allow this queue tofill up more before generating the interrupts).

The HI pool and/or LI pool designations are used in conjunction with theLLR pool and/or HLT pool designations in some embodiments, while theyare used in place of the LLR pool and/or HLT pool designations in otherembodiments. A queue may be designated as both LLR and HI queue in someembodiments. Alternatively, an LLR VM might be included with HLT VMs ina queue, but the queue might be designated as a HI queue, so that itscore is not as heavily loaded and can therefore be interruptedfrequently to empty out the queue.

II. Queue Management System

The queue management system of some embodiments will now be described byreference to FIG. 5. This system 500 breaks up the queues into differentpriority pools with the higher priority pools reserved for particulartypes of traffic or VMs (e.g., traffic for VMs that need low latency).It also dynamically adjusts the queues in each pool (i.e., dynamicallyadjusts the size of the pools), and dynamically reassigns a VM to a newqueue in its pool based on one or more optimization criteria (e.g.,criteria relating to the underutilization or overutilization of thequeue).

FIG. 5 illustrates (1) several VMs 505 that are executing on a host, (2)the host's physical NIC 515 that is shared by the VMs, (3) a networkvirtualization layer 510 that executes on the host and that facilitatestraffic to and from the VMs through the shared PNIC, and (4) a physicalprocessor scheduler 525 (also called physical CPU or PCPU) that is akernel scheduler that directs the processors as to when and where to runone of the threads (also called contexts).

The PNIC 515 has several queues 517. These queues include receive sidequeues for storing incoming data received by the host and transmit sidequeues for storing outgoing data transmitted from the VMs. In someembodiments, each queue includes a set of buffers for storing incomingor outgoing data. In some embodiments, the receive side queues areseparate and independent from the transmit side queues, but thevirtualization layer pairs one receive side queue with one transmit sidequeue so that the queue pair can be used as one queue construct for aVM. Other embodiments, however, do not “pair” the queues. In otherwords, these embodiments do not require all the VMs that use a receiveside queue to use the same transmit side queue; two VM can use the samereceive side queue, but different transmit side queues.

The PNIC also has a receive (RX) side processing engine 511 forreceiving incoming packets from a wired or wireless link. The RXprocessing engine has a MAC filter 514, which is configured to associateeach VM's incoming traffic to one queue pair based on the destinationMAC. The virtualization layer maintains an analogous filter 516 foroutgoing packets, and a queue selector 518 in this layer uses the datain this filter to configure each VM's outgoing traffic to use the samequeue pair as the incoming traffic. In some embodiments, the filter 516specifies a VM in terms of the VM's or its VNIC's source MAC address,while in other embodiments it specifies a VM in terms of the port ID ofa software forwarding element to which the VM's VNIC connects. In someembodiments, the PNIC also includes circuitry for monitoring the queuesand generating interrupts, as described above by reference to FIG. 4.

The VMs executes on top of a hypervisor (not shown), which, in someembodiments, includes the network virtualization layer 510. FIG. 5 showseach VM to include a virtual NIC (VNIC) 507. It also shows the networkvirtualization layer 510 to include (1) one network stack 550 for eachVM, (2) a software forwarding element 535, (3) a statistics-gatheringengine 540, (4) a statistics storage 545, and (5) a dynamic loadbalancer 555. Each network stack includes a VNIC emulator 527, and anI/O chain 529. Each network stack is managed by receive/transmit threads531.

Each network stack connects to its VM through its VNIC emulator andconnects to the software forwarding element 535, which is shared by allthe network stacks of all the VMs. Each network stack connects to thesoftware forwarding element through a port (not shown) of the switch. Insome embodiments, the software forwarding element maintains a singleport for each VNIC. The software forwarding element 535 performspacket-processing operations to forward packets that it receives on oneof its ports to another one of its ports, or to one of the ports ofanother software forwarding element that executes on another host. Forexample, in some embodiments, the software forwarding element tries touse data in the packet (e.g., data in the packet header) to match apacket to flow based rules, and upon finding a match, performs theaction specified by the matching rule.

In some embodiments, software forwarding elements executing on differenthost devices (e.g., different computers) are configured to implementdifferent logical forwarding elements (LFEs) for different logicalnetworks of different tenants, users, departments, etc. that use thesame shared compute and networking resources. For instance, two softwareforwarding elements executing on two host devices can perform L2 switchfunctionality. Each of these software switches can in part implement twodifferent logical L2 switches, with each logical L2 switch connectingthe VMs of one entity. In some embodiments, the software forwardingelements provide L3 routing functionality, and can be configured toimplement different logical routers with the software L3 routersexecuting on other hosts.

In the virtualization field, some refer to software switches as virtualswitches as these are software elements. However, in this document, thesoftware forwarding elements are referred to as physical forwardingelements (PFEs), in order to distinguish them from logical forwardingelements, which are logical constructs that are not tied to the physicalworld. In other words, the software forwarding elements are referred toas PFEs because they exist and operate in the physical world, whereaslogical forwarding elements are simply a logical representation of aforwarding element that is presented to a user. Examples of logicalforwarding elements are logical forwarding elements, such as logicalswitches, logical routers, etc. U.S. patent application Ser. No.14/070,360 provides additional examples of PFEs and LFEs, and isincorporated herein by reference.

The software forwarding element 535 connects to the PNIC to sendoutgoing packets and to receive incoming packets. In some embodiments,the software forwarding element is defined to include a port throughwhich it connects to the PNIC to send and receive packets. As mentionedabove, the queue selector 518 is interposed between the softwareforwarding element 535 and the PNIC in some embodiments. The queueselector selects the receive side queues for retrieving incoming packetsand transmit side queues for supplying outgoing packets. As mentionedabove, the queue selector uses the data in the filter 516 to identifythe transmit side queue for supplying a particular VMs outgoing traffic.The selector does not use the data in the filter to select a queue andretrieve its packets for a RX thread of a VM. In some embodiments, thequeue selector is part of the receive/transmit threads 531 of thenetwork stacks, as further described below. As such, for theseembodiments, the queue selector 518 is a conceptual representation ofthe queue selection operation that the receive/transmit threads 531perform in some embodiments.

Each VNIC in the VM is responsible for exchanging packets between the VMand the network virtualization layer through its associated VNICemulator 527. Each VNIC emulator interacts with NIC drivers in the VMsto send and receive data to and from VMs. In some embodiments, the VNICsare software abstractions of physical NICs implemented by virtual NICemulators. For instance, the code for requesting and obtaining aconnection ID reside in components of virtual NIC emulators in someembodiments. In other words, the VNIC state is implemented andmaintained by each VNIC emulator in some embodiments. Virtual devicessuch as VNICs are software abstractions that are convenient to discussas though part of VMs, but are actually implemented by virtualizationsoftware using emulators. The state of each VM, however, includes thestate of its virtual devices, which is controlled and maintained by theunderlying virtualization software. Even though FIG. 5 shows one VNICemulator for each VNIC of each VM, each VNIC emulator may maintain thestate for more than one VNIC and/or for more than one VM in someembodiments.

The I/O chain in each network stack includes a series of modules thatperform a series of tasks on each packet. As described in theabove-incorporated U.S. patent application Ser. No. 14/070,360, twoexamples of I/O chain modules are an ARP and DHCP proxy modules thatresolve ARP and DHCP broadcast messages without resorting tobroadcasting these messages. Other examples of the processes performedby the modules in the I/O chain include firewall and traffic tunnelingoperations. The input/output of the I/O chain goes to one of the portsof the software forwarding element.

In some embodiments, the receive/transmit threads 531 of each networkstack 550 are kernel-level threads that manage the modules in thenetwork stack. These threads also manage the PNIC queue 517 that isassociated with the stack's VM. Specifically, in some embodiments, thereceive side of each queue has a dedicated RX kernel thread to handleinterrupts and poll packets from the receive side of the queue. Also,each VM has a dedicated TX kernel thread to handle packets sent from theVM. In some embodiments, each pair of receive/transmit threads areexecuted by one of the cores of a multi-core processor(s) of the host,as the recommended number of queues in these embodiments equals thenumbers of the cores of the multi-core processor(s) of the host. Eventhrough separate receive and transmit threads are used for separatelymanaging the receive and transmit operations of the stack and itsassociated queue in FIG. 5, one of ordinary skill will realize that inother embodiments one thread is used to perform both of these tasks.Also, in some embodiments, the RX/TX thread(s) may not be tied or asstrictly tied to the queues, cores and/or VMs.

As mentioned above, the network virtualization layer also includes thestatistics (stat) gathering engine 540, the stat storage 545 and thedynamic load balancer 555. The stat gathering engine 540, load balancer555 and the RX/TX threads 531 form in part the queue management systemof some embodiments. The statistics that are gathered by the statgathering engine 540 provide the load balancer with the information thatit needs to determine when to assign queues to pools and when to adjustpools.

The stat gathering engine gets stats from different sources in differentembodiments. For instance, in some embodiments, this engine pulls statsor receives pushed stats from either the CPU scheduler 525 (for CPUutilizations) and the RX/TX threads (for network traffic). For thenetwork traffic, the network virtualization layer has stats (such asthroughput, packet rate, packet drops, etc) gathered from a variety ofsources, including each layer of the network stacks (i.e., each modulemanaged by the RX/TX threads).

In some embodiments, the stats gathering engine gathers the followingnetwork stats for the load balancer: PNIC packet rate, PNIC throughput,and the CPU utilization for each of RX/TX threads. In some embodiments,the CPU scheduler 525 updates the CPU utilization data, while the RX/TXthreads update the PNIC packet rate and throughput, since they are thethreads that actually communicate with the PNIC and have the exactcounts. In some embodiments, a PNIC driver module is below the queueselector, and this PNIC driver is the module that communicates with thePNIC and updates the PNIC load statistics. Also, in some embodiments,the stats gathering engine not only gathers the PNIC statistics for theload balancer, but also gathers VNIC stats collected by the VNICemulator.

By relying on VNIC stats, the load balancer can decide to move alatency-sensitive VM to an exclusive queue when its VNIC packet rate isabove some threshold that might start hurting whichever VMs sharing thesame queue with it. More generally, the load balancer 555 uses thegathered stats to determine which queues to assign to which VMs, when todynamically assign queues to pools and when to dynamically adjust pools.

In some embodiments, the load balancer periodically (e.g., every fewseconds, few milliseconds, few micro-seconds, etc.) runs a loadbalancing process. This process pulls stats from the “load stats” datastorage 545 that the stat gathering engine maintains, and based on thesestats, determines whether it needs to allocate pools, to de-allocatepools, to assign VMs to queues, to resize pools, and/or to preemptqueues. In some embodiments, the load balancer assigns VMs to queues byconfiguring the filters of the PNIC and the virtualization layer toassociate a particular queue identifier with a particular source MACaddress for outgoing traffic and a particular destination MAC forincoming traffic. To configure the MAC filters of the PNIC, the loadbalancer uses APIs of the PNIC driver to program filters and hardwarefeatures for each queue.

As shown in FIG. 5, the load balancer has three modules, which are thepools 561, the queue balancer 559 and the pool balancer 557. Pools are asoftware abstract grouping of PNIC queues that the load balancerdefines. The load balancer applies different processes to manage queuesin different “pools.” As such, each pool can be viewed as a set ofqueues that have the same “feature,” where a feature is analogous tohardware features (like RSS/LRO). Examples of such features include VMsrequirements (such as low-latency or low-interrupt-rate).

By applying different processes to manage queues in different pools, theload balancer can optimize the allocation of queues and the resizing ofthe pools differently for different pools. The pool rebalancer 557resizes each pool based on the pool's resource allocation criteria andpreempts queues from other pools when necessary. Example of suchresource allocation criteria include max/min number of queues of thepool, total CPU utilization of the pool, network traffic of the pool,quality of service (QoS) constraints of the pool, etc. The queuerebalancer 559 rebalances the queues in the same pool based on thepool's rebalancing criteria. Example of such pool rebalancing criteriainclude packing VMs on as few queues as possible (e.g., for an HLTpool), distributing the VMs across as many queues as possible (e.g., foran LLR pool), etc. In some embodiments, two different processes thatmanage two different pools specify different resource allocationcriteria, different preemption criteria, different rebalancing criteria,etc.

III. Adjusting VM Allocations and Adjusting Pools

FIG. 6 conceptually illustrates an overall process 600 that the loadbalancer 555 performs in some embodiments. The load balancer 555 in someembodiments performs this process periodically (e.g., every few seconds,few milliseconds, few micro-seconds, etc.) to assign VMs to queues, torebalance queues within each pool and to maintain desired balance acrossthe pools.

In some embodiments, the process 600 starts when a timer (e.g., an Nsecond timer) expires. As shown in FIG. 6, the process 600 initiallyinvokes (at 605) a queue assignment process that examines the VMs in thedefault pool to identify any VM that it has to move to a non-defaultpool, and moves any identified VM to the appropriate non-default queuein the non-default pool. In some embodiments, the queue assignmentprocess moves (at 605) a VM to a non-default queue when the VM's use ofa default queue exceeds a threshold level for the default queue or forthe VM's use of the default queue. At 605, the process also identifiesany VM in a non-default queue that has to move back to the default pool,and moves back to the default pool any identified VM. In someembodiments, the queue assignment process moves (at 605) a VM back tothe default pool when the VM's use of its non-default queue is below athreshold level for the non-default queue or the VM's use of thenon-default queue. The queue assignment process of some embodiments isfurther described below by reference to FIG. 7.

After 605, the process 600 invokes (at 610) a pool adjustment process torebalance queues within each pool. In some embodiments, the pooladjustment process examines each pool to determine whether it has tomove one or more VMs between queues in the pool or to new queues in thepool based on one or more optimization criteria for the pool. The pooladjustment process of some embodiments uses different optimizationcriteria for different pools. For instance, in some embodiments, theoptimization criteria for an LLR pool biases the process to distributethe VMs between the queues of the LLR pool, while the optimizationcriteria for an HLT pool biases the process to aggregate the VMs ontofewer queues in the HLT pool. Based on these criteria and itsdeterminations at 610, the process 600 re-assigns (at 610) VMs betweenqueues in a pool or to a new queue in the pool. The pool adjustmentprocess of some embodiments is further described below.

Next, at 615, the process 600 invokes a pool balancing process thatmaintains the desired balance across the pools. In some embodiments, thepool balancing process examines the utilization of queues across thevarious pools. Based on this examination, the balancing process mayallocate one or more queues to one pool. It may also de-allocate one ormore queues from another pool based on this examination. In oneinvocation, this process may allocate more queues to more than one pool,or it might de-allocate queues in more than one pool. The rebalancingprocess across pools is further described below.

One or ordinary skill will realize that the load balancing process 600is different in other embodiments. For instance, in some embodiments,the process 600 does not have a separate rebalancing operation 615, butrather performs this operation implicitly or explicitly as part of theoperations 605 and 610. Also, while certain sub-operations are explainedabove and below as being part of one of the operations 605, 610, and615, one of ordinary skill will realize that these sub-operation can beperformed in different ones of these operations 605, 610, or 615, or asdifferent operations on their own or as sub-operations of differentoperations.

After 615, the process 600 ends.

FIG. 7 conceptually illustrates a queue assignment process 700 that theload balancer 555 invokes periodically (e.g., every few seconds, everymilliseconds, every micro-seconds, etc.) in some embodiments. The loadbalancer 555 in some embodiments invokes this process periodically toidentify and move any VM that is overutilizing the default queue orunderutilizing a non-default queue. As shown in FIG. 7, the process 700initially gathers (at 705) statistics from the stats storage 545regarding all VMs' usage of the default queue and the non-defaultqueues. The process uses the retrieved statistics to perform itsanalysis as further described below.

Next, based on the retrieved statistics, the process identifies (at 710)any VM that uses the default queue in the default pool that is currentlyexceeding a threshold usage level for the default queue or for the VM'suse of the default queue (e.g., when different VMs have differentthreshold usage levels for the default queue). As mentioned above, someembodiments assign each VM to the default queue when the VM isinitialized, but monitor each VM's usage of the default queue and movethe VM to a non-default queue when the VM's usage exceeds a thresholdvalue.

At 715, the process determines whether it was able to identify any VM at710. If not, the process transitions to 765, which will be furtherdescribed below. Otherwise, the process transitions to 720 to select oneof the VMs identified at 710 and to identify the selected VM'srequirements for joining a queue of a pool. As mentioned above, someembodiments of the invention define one or more non-default pools ofqueues that meet one or more differing requirements of different sets ofVMs.

Next, at 725, the process determines whether it has previously defined apool for the selected VM's requirements. For instance, assuming that theselected VM is an LLR VM, the process determines (at 725) whether it haspreviously defined an LLR pool to assign LLR VMs to this pool's queues.When the process determines (at 725) that it has previously defined thepool for the VM's requirements, it then determines (at 730) whether itcan assign the selected VM to one of the queues in this previouslydefined pool. In other words, at 730, the process determines whether theexisting queues of the pool have sufficient available capacity for theVM selected at 720.

When a queue in the pool has sufficient capacity, the process assigns(at 735) the VM to this queue and to this queue's pool, and thentransitions to 740, which will be described below. As mentioned above,some embodiments create the association between a VM and a queue throughfiltering, which uses source MAC address to tie a VM's outgoing trafficto a particular queue and destination MAC address to tie incomingtraffic to the particular queue. Some embodiments explicitly specify theassociation between the VM and the pool, while other embodimentsimplicitly specify this association through the association between theVM's associated queue and the pool. The association between the VM andthe pool that is created at 735 allows the load balancer to apply acommon set of processes to manage the VM on its queue with other VMs onthis and other queues in the same pool. As mentioned above, these set ofprocesses are different from the set of processes used to manage otherVMs in other pools of queue in some embodiments.

When the process determines (at 725) that a pool does not exist for theselected VM's requirement (e.g., an LLR requirement), the processspecifies (at 745) a pool for the selected VM's requirement (e.g.,specifies an LLR pool), and then transitions to 750. The process alsotransitions to 750 when it determines (at 730) that the previouslyspecified pool for the VM's requirement does not have any queue that hassufficient capacity for the selected VM.

At 750, the process determines whether there is any PNIC queue that iscurrently unassigned to any VM (i.e., whether there is any queue in thefree pool of queues). If so, the process (at 755) selects one of thefree queues, assigns it to the pool, assigns the selected VM to thisqueue and the pool, and then transitions to 740, which will be describedbelow. Otherwise, the process preempts (at 760) one of the queues usedby another one of the non-free pools. Preemption involves firstreassigning the VMs that are using the preempted queue to other queuesin the pool that includes the preempted queue. In some embodiments, theprocess will not be able to preempt a queue from another pool in somecases because the current non-default pool of the VM has a lowerpriority than the other non-default pools. Once all the VMs have beenreassigned and the queue has processed all of the traffic for such VMs,the process assigns (760) the preempted queue to the pool for theselected VM's requirement. The process also assigns (at 760) theselected VM to this queue and the pool, and then transitions to 740.

At 740, the process determines whether it has processed all of the VMsidentified at 710. If not, the process returns to 720 to select anotheridentified VM. Otherwise, after the process has assigned each VMidentified at 710 to a pool and queue, the process determines (at 765)whether any VM that is in a non-default queue should be moved back tothe default pool, based on the statistics retrieved at 705. In someembodiments, the process moves a VM back to a default queue when theVM's usage of a non-default queue falls below a threshold usage levelfor the non-default queue or for the VM's use of the non-default queue.The process in some of these embodiments only moves the VM when itsusage has been below the threshold usage level for a sufficiently longperiod of time. When the process identifies any VM at 765, it moves theidentified VM back to a default queue of the default pool, and removesthis VM from the pool of its previously assigned non-default queue. Whenno other VM uses the previously assigned non-default queue after thisre-assignment, the process 700 also re-allocates (at 765) thenon-default queue to the pool of free queues. When this re-allocatedqueue is the last queue of a pool, the process 700 of some embodimentsalso de-allocates the pool as it no longer contains any queues. Otherembodiments, however, do not de-allocate the pool in such circumstances.After 765, the process ends.

FIG. 8 conceptually illustrates a pool adjustment process 800 that theload balancer 555 invokes periodically (e.g., every few seconds, everymilliseconds, every microseconds, etc.) in some embodiments. The loadbalancer 555 in some embodiments invokes this process periodically torebalance queues within each pool. In some embodiments, the pooladjustment process examines each pool to determine whether it has tomove one or more VMs between queues in the pool or to new queues in thepool based on one or more optimization criteria for the pool. The pooladjustment process of some embodiments uses different optimizationcriteria for different pools. For instance, in some embodiments, theoptimization criteria for an LLR pool biases the process to distributethe VMs between the queues of the LLR pool, while the optimizationcriteria for an HLT pool biases the process to aggregate the VMs ontothe queues in the HLT pool.

As shown in FIG. 8, the process 800 initially gathers (at 805)statistics from the stats storage 545 regarding all VMs' usage of thedefault queue and the non-default queues. The process uses the retrievedstatistics to perform its analysis as further described below. After805, the process 800 selects (at 810) one of the non-default pools toexamine (e.g., selects the LLR pool to examine). Next, at 815, theprocess determines whether any queue in the pool is underutilized. If itfinds any such queue, the process then determines (at 815) whether anyother queue in the selected pool has the capacity for the VM or VMs thatare currently using any underutilized queue that it identified at 815.When the process identifies such queues with excess capacity, theprocess assigns (at 815) the VM or VMs from an underutilized queue tothe queue or queues with excess capacity. The process also de-allocates(at 815) the underutilized queue from the selected pool (i.e., assignsthe underutilized queue to the free pools) when the underutilized queuedoes not have any other VM assigned to it after the move. In thismanner, an underutilized queue can be freed up for allocation to anotherpool or to the same pool at a later time.

After 815, the process identifies (at 820) any queue that is beingoverutilized in the selected pool. The process then determines (at 825)whether the selected pool has any queue with excess capacity to handlethe traffic of one or more of the VMs that are currently assigned to theidentified overutilized queue. When the process identifies (at 825) oneor more queues with excess capacity, the process assigns (at 830) one ormore of the VMs that are currently assigned to the identifiedoverutilized queue to one or more queues with excess capacity, and thentransitions to 835, which will be described below.

On the other hand, when the process determines (at 825) that theselected pool does not have any excess capacity queues to handle some ofthe traffic that is currently going through the overutilized queue, theprocess determines (at 840) whether there are any free queues (e.g.,whether the free pool has any queues). If so, the process allocates (at845) one or more of the free queues to the selected pool (i.e., the poolselected at 810). At 845, the process also assigns one or more of theVMs that are currently assigned to the identified overutilized queue tothe newly allocate queue(s), and then transitions to 835, which will bedescribed below

When the process determines (at 840) that there are no free queues, theprocess determines (at 850) whether it can preempt a queue from anotherpool. In some embodiment, not all pools can preempt queues from otherpools; only some pools (e.g., LLR pool) can preempt queues from otherpools (e.g., HLT pool). Also, in some embodiments, some pools canpreempt queues from other pools only under certain circumstances (e.g.,when the other pool is not heavily overloaded itself).

When the process determines (at 850) that it cannot preempt a queue fromanother pool, it transitions to 835. On the other hand, when the processdetermines (at 850) that it can preempt a queue from another pool, theprocess assigns (at 855) all the VMs that are currently using thepreempted queue to new queues within the same pool as the preemptedpool. After re-assigning all of the VMs, the process then allocates (at855) the preempted queue to the selected pool (i.e., the pool selectedat 810) and assigns one or more of the VMs that are currently assignedto the identified overutilized queue to the newly allocate queue(s), andthen transitions to 835.

At 835, the process determines whether it has examined all thenon-default pools that it should examine. If so, it ends. Otherwise, theprocess returns to 810 to select another non-default pool. The processuses different criteria to assess underutlization and overutilization ofqueues for different pools. For instance, for an LLR pool, theoverutilization threshold might be a 50% load and the underutilizationthreshold might be a 5%, while for an HLT pool, the overutilizationthreshold might be 90% and the underutilization threshold might be 75%.

Also, as mentioned above, some embodiments resize different pools basedon different pool resource allocation criteria, such as max/min numberof queues of the pool, total CPU utilization of the pool, networktraffic of the pool, quality of service (QoS) constraints of the pool,etc. Similarly, some embodiments rebalance the queues in the differentpools based on different rebalancing criteria, such as packing VMs on asfew queues as possible (e.g., for an HLT pool), distributing VMs acrossas many queues as possible (e.g., for an LLR pool), etc.

In some embodiments, the process 800 of FIG. 8 examines underutilizedqueues in order to re-assign their VMs to other queues in the same poolso that it can de-allocate the underutilized queues. In addition, orinstead of, examining the underutilized queues, the load balancer ofsome embodiments examines each VM that is assigned to a non-defaultqueue to determine whether the VM's use of its non-default queue isbelow a threshold level. FIG. 9 illustrates a process 900 that isperformed by the load balancer of some embodiments to assess a VM'sutilization of its queue.

As shown in this figure, the process 900 starts by selecting (at 905) aVM of a non-default queue. Next, at 910, the process determines whetherthe VM is underutilizing its queue. For instance, in some embodiments, aVM is deemed to underutilize its queue when only 1% of the trafficthrough the queue is attributed to the VM.

When the process determines (at 910) that the VM is not underutilizingits queue, the process ends. Otherwise, the process determines (at 915)whether another queue in the same pool as the current queue of the VMhas capacity for the selected VM. If not, the process ends. Otherwise,the process (at 920) assigns the VM to a new queue in the same pool asthe current queue of the VM. The process then de-allocates (at 925) theprevious queue of the VM if the previous queue does not have any otherVM assigned to it.

FIG. 10 illustrates an example of re-assigning a VM from a first queueto a second queue because of the underutilization of the first queue orbecause of the VM's underutilization of the first queue. Specifically,this figure illustrates the network stacks 1005 and 1010 of two VMs 1015and 1020 that are initially assigned to two queues 1025 and 1030. Thesetwo VMs are assigned to these two queues upon the initialization of theVMs.

At some later point in time, the load balancer 555 retrieves statisticsregarding the various VMs use of various queues. Based on thesestatistics, the load balancer detects that the VM 1020 uses less than 1%of the capacity of the queue 1030. As this VM is underutilizing itsqueue, the load balancer then defines filters in network virtualizationlayer and in the PNIC that re-assign the VM 1020 and its network stack1010 from the queue 1030 to the queue 1025. If no other VM is using thequeue 1030, the load balancer will also re-allocate the queue 1030 tothe free pool of queues. In some embodiments, the load balancer wouldre-assign VM 1020 to the queue 1025 only if it determines that upon thisre-assignment (and the re-assignment of any other VMs that areconcurrently using the queue 1030), the queue 1030 could be freed fromany VM traffic and hence re-allocated to the free pool. In otherembodiments, however, the load balancer makes its re-allocation decisionsolely on the VM's own usage of its queue.

FIG. 11 illustrates an example of re-assigning a VM from a first queueto a second queue because of the overutilization of the first queue.Specifically, this figure illustrates the same two network stacks 1005and 1010 of the same two VMs 1015 and 1020 after both VMs have beenassigned to the queue 1025. At some point in time, the load balancer 555retrieves statistics regarding the various VMs use of various queues.Based on these statistics, the load balancer detects that the queue 1025is handling traffic at 90% of its capacity. As this queue is beingoverutilized, the load balancer then defines filters in networkvirtualization layer and in the PNIC that re-assign the VM 1020 and itsnetwork stack 1010 from the queue 1025 to the queue 1030.

As mentioned above, the load balancer in some embodiments is a poolbalancing process that maintains the desired balance across the pools.In some embodiments, the pool balancing process examines the utilizationof queues across the various pools. Based on this examination, thebalancing process may allocate one or more queues to one pool. It mayalso de-allocate one or more queues from another pool based on thisexamination. In one invocation, this process may allocate more queues tomore than one pool, or it might de-allocate queues in more than onepool.

FIG. 12 illustrates an example of the pool balancing across the pools.Specifically, it shows the addition of a queue from a low priority poolto a high priority pool 1215 and 1220 in two stages 1205 and 1210. Inthe first stage, the load balancer 555 retrieves statistics from thestatistics storage 545. Based on the retrieved statistics, the loadbalancer determines that the load (LH) through the queues of the highpriority pool 1220 is more than a certain percentage (e.g., 75%) of theload (LL) through the queues of the low priority pool 1215.

As some embodiments want the load through the high priority queues to besubstantially less than the load through the low priority queues, theload balancer sets the filters so that a queue 1250 from the lowpriority pool is removed from this pool and added to the high prioritypool. Before the queue 1250 can switch to the high priority pool, theVMs that were using the queue 1250 in the low priority pool have to beassigned to different queues in this pool. FIG. 12 illustrates thatafter the addition of the queue 1250 to the high priority queue, theload (LH) through the high priority queues is less than the specifiedpercentage of the load (LL) through the low priority queues.

IV. Non-VM Addressable Nodes and Other Filtering

Several embodiments have been described above that, for data traffic toor from one or more VMs executing on a host device, dynamically definepools of queues, uniquely manage each pool, dynamically modify queueswithin a pool, and dynamically re-assign a VM's traffic to a new queue.Many of these embodiments use the destination or source MAC addresses ofthe packets received or transmitted by the host device to assign the VMdata traffic packets to the different pools and the different queueswithin the pools.

However, not all embodiments use MAC addresses to assign data traffic topools and queues within pools. Also, the queue management methods andapparatuses of some embodiments are used for data traffic other than VMdata traffic. Specifically, in addition to the VM data traffic orinstead of VM data traffic, some embodiments dynamically define pools,uniquely manage each pool, dynamically modify the queues within thepools, and dynamically re-assign data traffic to and from non-VMaddressable nodes (e.g., source end nodes or destination end nodes) thatexecute on a host. The methods and apparatuses of some embodiments areused to perform these operations to differentiate the routing ofdifferent types of data traffic through the queues.

Several such embodiments are further described below. Specifically,sub-section A describes several embodiments that use MAC addressfiltering to route non-VM traffic data to different queues of differentpools. Sub-section B then describes several embodiments that usefive-tuple IP filtering to route different types of data traffic todifferent queues of different pools.

A. MAC Filtering for Non-VM Traffic

Some embodiments use MAC address filtering to route data traffic ofnon-VM data addressable nodes executing on a host device to differentpools of queues, and different queues within the pools. For instance,the method of some embodiments monitors data traffic for a set of non-VMaddressable nodes (e.g., data end nodes) through the physical NIC of ahost device. Based on this monitoring, the method specifies a pool forat least a set of the non-VM addressable nodes, and assigns a set of thequeues to the pool. The method then uses destination or source MACfiltering to direct to the assigned set of queues the data traffic thatis received by, or transmitted from, the host device for the set ofnon-VM addressable nodes.

Alternatively, or conjunctively, based on the monitoring, the method canmodify the set of queues assigned to a pool for the set of the non-VMaddressable nodes. Examples of such modifications include adding orremoving a queue from the pool when one or more of the queues of thepool are overutilized or underutilized. In some embodiments, the methodadds a queue to the pool by preempting a queue from another pool, e.g.,by using one of the above-described preemption methodologies.

Also, alternatively or conjunctively to the above-described operations,the method can re-assign the data traffic for a non-VM addressable node(e.g., data end node) from a first queue in the pool to a second queuein the pool, based on the monitoring. For instance, based on themonitoring, the method of some embodiments detects that the traffic forthe non-VM addressable node through the first queue falls below aminimum threshold amount of traffic (e.g., for a duration of time).Because of this underutilization, the method switches this traffic tothe second queue. Before making this switch, the method of someembodiments determines that the traffic through the second queue doesnot exceed a maximum threshold amount of traffic.

Based on the monitoring, the method of some embodiments detects that thetraffic through the first queue exceeds a maximum threshold amount oftraffic (e.g., for a duration of time). Because of this overutilization,the method switches the traffic for a non-VM addressable node (e.g.,data end node) from the first queue to the second queue. Again, beforemaking this switch, the method of some embodiments determines that thetraffic through the second queue does not exceed a maximum thresholdamount of traffic.

FIG. 13 illustrates a queue management system 1300 of some embodimentsthat uses MAC address filtering to route data traffic of VMs and non-VMdata addressable nodes executing on a host device to different pools ofqueues, and different queues within the pools. This system is similar tothe system 500 of FIG. 5, except that the system 500 only managestraffic to and from VM end nodes, while the system 1300 manages trafficto and from VM and non-VM addressable nodes, such as an iSCSI (internalsmall computer system interface) mounter 1305, an NFS (network filestorage) mounter 1307, and a VM migrator 1309. The system 1300 breaks upthe queues into different priority pools with the higher priority poolsreserved for particular types of addressable nodes. It also dynamicallyadjusts the queues in each pool (i.e., dynamically adjusts the size ofthe pools), and dynamically reassigns an addressable node to a new queuein its pool based on one or more optimization criteria (e.g., criteriarelating to the underutilization or overutilization of the queue).

FIG. 13 illustrates (1) several VMs 505 executing on a host (not shown),(2) two mounted storage volumes 1320 and 1325, (3) VM migration data1330, (4) the host's physical NIC 515 that is shared by the VM andnon-VM nodes, and (5) a virtualization layer 1307 that facilitatestraffic to and from the VMs through the shared PNIC. As further shown,the virtualization layer includes several non-VM addressable nodes, suchas the iSCSI mounter 1305, the NFS mounter 1307, and the VM migrator1309. This layer also includes a network stack 1311, 1313, 1315, 1317 or1319 for each VM or non-VM addressable node. The virtualization layeralso includes a software forwarding element 535 (e.g., software switch),a queue selector 518, a dynamic load balancer 555, statistic gatheringengine 540, and statistic storage 545.

In some embodiments, the PNIC 515 of FIG. 13 is identical to theabove-described PNIC 515 of FIG. 5. As mentioned above, the PNIC has areceive side (RX) processing engine 511 for receiving incoming packetsfrom a wired or wireless link. The RX processing engine has a MAC filter514, which is configured to associate each (VM or non-VM) addressablenode's incoming traffic to one queue pair based on the destination MAC.The virtualization layer maintains an analogous filter 516 for outgoingpackets, and a queue selector 518 in this layer uses the data in thisfilter to configure each addressable node's outgoing traffic to use thesame queue pair as the incoming traffic. In some embodiments, the filter516 specifies an addressable node in terms of the VM's or its VNIC'ssource MAC address, while in other embodiments it specifies a VM interms of the port ID of a software forwarding element to which the VM'sVNIC connects. As PNIC 515 of FIG. 13 is identical to the PNIC 515 ofFIG. 5, it will not be described further in order not to obscure thedescription of FIG. 13 with unnecessary detail

The VM and non-VM addressable nodes executes on top of a hypervisor (notshown), which, in some embodiments, includes the virtualization layer1310. The VM and non-VM addressable nodes can be source and destinationend nodes for packets that are transmitted through the network. Asmentioned above, these nodes include the VMs 505, the volume mounters1305 and 1307, and the VM migrator 1309. The iSCSI mounter 1305 mounts astorage volume 1320 on the host. This storage volume 1320 is some or allof an external storage (i.e., a storage external to the host, such as astorage server) that is accessible through the iSCSI protocol.Similarly, the NFS mounter 1307 mounts a storage volume 1325 on thehost. This storage volume 1325 is some or all of an external storage(e.g., a storage server) that is accessible through the NFS protocol.The mounted volumes can then be accessed by the modules (e.g., VMs)executing on the host or other devices, as if the external storagesreside on the host. The VM migrator 1309 gathers data about each VMexecuting on the host to facilitate the live migration of a VM from onehost to another. One example of such a VM migrator is the vMotion moduleused in the ESX hypervisor of VMware Inc.

Each addressable node connects to the software forwarding element 535through a network stack and a port (not shown) of the forwardingelement. In some embodiments, each VM's network stack includes a VNICemulator 527 and an I/O chain 529, and is managed by receive/transmitthreads 531, as described above by reference to FIG. 5. In someembodiments, the network stack of each non-VM addressable node includesa hypervisor kernel network interface and receive/transmit threads. Insome embodiments, the hypervisor kernel network interface (e.g., vmknicof VMware Inc.) of each non-VM addressable node includes a TCP/IP stackfor processing TCP/IP packets received for the non-VM addressable nodeand sent by the non-VM addressable node. For instance, in someembodiments, each non-VM addressable node's network (1) affixes TCP/IPpacket headers to packets that it sends from its corresponding mountedvolume 1320/1325 or migration data store 1330, and (2) removes TCP/IPpacket headers from packets that it receives for storing in itscorresponding mounted volume or migration data store.

In some embodiments, the hypervisor kernel network interface of a non-VMaddressable node (e.g., the VM migrator 1309) does not include a TCP/IPstack, but rather includes other packet processing modules, such an RDMA(remote direct memory access) packet processing module. Also, in someembodiments, the network stack of a non-VM addressable node includesother I/O chain modules for performing other transform operations on thepackets sent by and received for their corresponding volumes or datastores. Like the receive/transmit threads 531 of FIG. 5, thereceive/transmit threads of the network stack of each non-VM addressablenode manages the modules in the network stack, interacts with the PNICqueue 517 that is associated with the stack's non-VM addressable node,and gathers statistics regarding the operations of the modules of itsstack.

As mentioned above, the virtualization layer also includes thestatistics (stat) gathering engine 540, the stat storage 545 and thedynamic load balancer 555. The stat gathering engine 540, load balancer555 and the RX/TX threads (not shown) form in part the queue managementsystem of some embodiments. The statistics that are gathered by the statgathering engine 540 provide the load balancer with the information thatit needs to determine when to assign queues to pools and when to adjustpools.

The stat gathering engine gets stats from different sources in differentembodiments. For instance, in some embodiments, this engine pulls statsor receives pushed stats from either the CPU scheduler 525 (for CPUutilizations) and the RX/TX threads (for network traffic). For thenetwork traffic, the virtualization layer has stats (such as throughput,packet rate, packet drops, etc) gathered from a variety of sources,including each layer of the network stacks (i.e., each module managed bythe RX/TX threads).

In some embodiments, the stats gathering engine gathers the followingnetwork stats for the load balancer: PNIC packet rate, PNIC throughput,and the CPU utilization for each of RX/TX threads. In some embodiments,the CPU scheduler updates the CPU utilization data, while the RX/TXthreads update the PNIC packet rate and throughput, since they are thethreads that actually communicate with the PNIC and have the exactcounts. In some embodiments, a PNIC driver module is below the queueselector, and this PNIC driver is the module that communicates with thePNIC and updates the PNIC load statistics. Also, in some embodiments,the stats gathering engine not only gathers the PNIC statistics for theload balancer, but also gathers VNIC stats or stats that are gathered bythe non-VM addressable node.

By relying on the gathered stats, load balancer can decide to move alatency-sensitive (VM or non-VM) addressable node to an exclusive queueor higher priority pool when its packet rate is above some threshold, orit is being hurt by the throughput of one or more other nodes that sharethe same queue with it. More generally, the load balancer 555 uses thegathered stats to determine which queues to assign to which addressablenode, when to dynamically assign queues to pools and when to dynamicallyadjust pools.

In some embodiments, the load balancer periodically (e.g., every fewseconds, few milliseconds, few microseconds, etc.) runs a load balancingprocess. This process pulls stats from the “load stats” data storage 545that the stat gathering engine maintains, and based on these stats,determines whether it needs to assign addressable nodes to queues, toresize pools, and/or to preempt queues. The load balancer assigns nodesto queues by configuring the filters of the PNIC and the virtualizationlayer to associate a particular queue identifier with a particularsource MAC address for outgoing traffic and a particular destination MACfor incoming traffic. To configure the MAC filters of the PNIC, the loadbalancer uses APIs of the PNIC driver to program filters and hardwarefeatures for each queue.

As shown in FIG. 13, the load balancer has three modules, which are thepools 561, the queue balancer 559 and the pool balancer 557. Pools are asoftware abstract grouping of PNIC queues that the load balancerdefines. The load balancer applies different processes to manage queuesin different “pools.” As such, each pool can be viewed as a set ofqueues that have the same “feature,” where a feature is analogous tohardware features (like RSS/LRO). Examples of such features include VMsrequirements (such as low-latency or low-interrupt-rate).

By applying different processes to manage queues in different pools, theload balancer can optimize the allocation of queues and the resizing ofthe pools differently for different pools. The pool rebalancer 557resizes each pool based on the pool's resource allocation criteria andpreempts queues from other pools when necessary. Example of suchresource allocation criteria include max/min number of queues of thepool, total CPU utilization of the pool, network traffic of the pool,quality of service (QoS) constraints of the pool, etc. The queuerebalancer 559 rebalances the queues in the same pool based on thepool's rebalancing criteria. Example of such pool rebalancing criteriainclude packing addressable nodes on as few queues as possible (e.g.,for an HLT pool), distributing addressable nodes across as many queuesas possible (e.g., for an LLR pool), etc. In some embodiments, the loadbalancer 555 manages the PNIC queues to process VM and non-VMaddressable nodes by using the processes like those described above byreference to FIGS. 6-9 above. In some of these embodiments, theseprocesses are just modified to monitor and manage not only VM trafficbut also traffic to and from non-VM addressable nodes (e.g., traffic toand from mounters 1305 and 1307 and the migrator 1309).

B. Alternative Filtering to Differentiate Different Types of Packets

Instead of MAC address filtering, some embodiments use other filteringtechniques to treat differently different types of packets, e.g., todefine different pools for different sets of packet types, to managedifferently each pool, to modify dynamically the queues within thepools, and to re-assign dynamically different types of data traffic. Forinstance, based on non-MAC packet identifiers, the method of someembodiments identifies and monitors a first type of data traffic throughthe NIC of a host device. Based on the monitoring, the method specifiesa pool for the first type of data traffic, and assigns a set of thequeues to the pool. The method then uses non-MAC address filtering todirect the first type of data traffic to the assigned set of queues.

Alternatively, or conjunctively, based on the monitoring, the method canmodify the set of queues assigned to a pool for the first type of datatraffic that is identified through the non-MAC packet identifiers.Examples of such modifications include adding or removing a queue fromthe pool when one or more of the queues of the pool are overutilized orunderutilized. In some embodiments, the method adds a queue to the poolby preempting a queue from another pool, e.g., by using one of theabove-described preemption methodologies.

Also, alternatively or conjunctively to the above-described operations,the method can re-assign the first type of data traffic from a firstqueue in the pool to a second queue in the pool, based on themonitoring. For instance, based on the monitoring, the method of someembodiments detects that the first type of data traffic through thefirst queue falls below a minimum threshold amount of traffic (e.g., fora duration of time). Because of this underutilization, the methodswitches this traffic to the second queue. Before making this switch,the method of some embodiments determines that the traffic through thesecond queue does not exceed a maximum threshold amount of traffic.

Alternatively, based on the monitoring, the method of some embodimentsmight detect that the first type of data traffic through the first queueexceeds a maximum threshold amount of traffic (e.g., for a duration oftime). Because of this overutilization, the method switches the firsttype of data traffic from the first queue to the second queue. Again,before making this switch, the method of some embodiments determinesthat the traffic through the second queue does not exceed a maximumthreshold amount of traffic.

Different embodiments use different non-MAC filtering. Some embodimentsuse the packet header data to classify the packet payload to be one ofseveral types. For instance, some embodiments use the five-tuple IP datain the L3 and L4 packet header to classify the packet payload. Thefive-tuple data include source port identifier, destination portidentifier, source IP address, destination IP address, and the protocol.Using these five identifiers, the filters of some embodiments candesignate the IP packets to be any number of different types, such asVOIP packet, video packet, audio packet, FTP packet, HTTP packet, HTTPSpacket, Remote Desktop packet (PCoIP, VNC, RDP), management packet(authentication, server health monitoring, time synchronization), E-mailpacket (POP3, SMTP), etc. Since all of these protocols have differenttraffic pattern, some embodiments separate one or more of them intodifferent pools of queues, and use different optimization criteria toallocate the data traffic to the queues in each pool.

The list provided below illustrates how the five tuples can be used todifferentiate web traffic, VoIP, video streaming, remote desktop,management, e-mails, by using the following notation:Protocol-src_ip-dst_ip-src_port-dest_port, with * denoting wildcardmatch. In this list, it is assumed that that a VM is the client thatrequests the service/data/service from the server.

-   -   Web: TCP-*-*-*-80/443 (80 for HTTP and 443 for HTTPS)    -   VoIP (Skype): TCP/UDP-*-*-23399-* or TCP/UDP-*-*-*-23399        (incoming and outgoing traffic)    -   Video Streaming (MMS): TCP/UDP-*-*-*-1755    -   Remote Desktop (PCoIP): TCP/UDP-*-*-*-4172    -   Authentication (Kerberos): TCP/UDP-*-*-*-88    -   E-Mail (POP3): TCP-*-*-*-110

FIGS. 14 and 15 illustrate examples that show some embodiments usefive-tuple filters to differentiate VOIP and video packets that aretransmitted or received by a virtual machine during a videopresentation. FIG. 14 illustrates the case where the five-tuples areused to differentiate VOIP and video packets that are being transmittedby a VM 1405. In this example, the dynamic load balancer 555 sets oneset of five-tuple filters 1410 in the queue selector 518 to route VOIPpackets from the VM 1405 to the high priority queue pool 1420, whilesetting another set of five-tuple filters to route video packets fromthis VM to the low priority queue pool 1425.

FIG. 15 illustrates the case where the five-tuples are used todifferentiate VOIP and video packets that are being received by a VM1405. In this example, the dynamic load balancer 555 sets one set offive-tuple filters 1510 in the RX processing engine 511 of the PNIC 515to route incoming VOIP packets (that are for the VM 1405) to the highpriority queue pool 1420, while setting another set of five-tuplefilters to route incoming video packets (that are for VM 1405) to thelow priority queue pool 1425.

The load balancer 555 sets the five-tuple filters in order to groupqueues into pools, which it then manages based on different criteria.Specifically, by relying on the gathered statistics in the stats storage545, the load balancer 555 can determine which addressable nodes toassign to which queues, when to dynamically assign queues to pools, whento dynamically remove queues from pools, and when to dynamicallyre-assign addressable nodes to new queues.

In some embodiments, the load balancer periodically (e.g., every fewseconds, few milliseconds, few microseconds, etc.) runs a load balancingprocess. This process pulls stats from the “load stats” data storage 545that the stat gathering engine maintains, and based on these stats,determines whether it needs to assign addressable nodes to queues, toresize pools, and/or to preempt queues. The load balancer assigns nodesto queues by configuring the five-tuple filters of the PNIC and thevirtualization layer to associate a particular queue identifier with aparticular five-tuple filter. To configure the filters of the PNIC, theload balancer uses APIs of the PNIC driver to program filters andhardware features for each queue.

As described above by reference to FIGS. 5 and 13, the load balancer 555in the virtualization layers of FIGS. 14 and 15 has three modules (notshown) in some embodiments. These three modules are (1) the storage thatstores the pools, the identifiers of their associated queues, and theaddressable node associated with each queue, (2) the queue balancer, and(3) the pool balancer.

The load balancer 555 of FIGS. 14 and 15 applies different processes tomanage queues in different “pools.” As such, each pool can be viewed asa set of queues that have the same “feature.” By applying differentprocesses to manage queues in different pools, the load balancer canoptimize the allocation of queues and the resizing of the poolsdifferently for different pools. The pool rebalancer resizes each poolbased on the pool's resource allocation criteria and preempts queuesfrom other pools when necessary. Example of such resource allocationcriteria (e.g., max/min number of queues of the pool, total CPUutilization of the pool, network traffic of the pool, quality of service(QoS) constraints of the pool, etc.) were provided above.

The queue rebalancer rebalances the queues in the same pool based on thepool's rebalancing criteria, such as packing addressable nodes on as fewqueues as possible (e.g., for an HLT pool), distributing addressablenodes across as many queues as possible (e.g., for an LLR pool), etc. Insome embodiments, the load balancer 555 of FIGS. 14 and 15 manages thePNIC queues to process VM and non-VM addressable nodes by using theprocesses like those described above by reference to FIGS. 6-9 above. Insome of these embodiments, these processes are just modified to monitorand manage not only VM traffic but also traffic to and from non-VMaddressable nodes.

V. Electronic System

Many of the above-described features and applications are implemented assoftware processes that are specified as a set of instructions recordedon a computer readable storage medium (also referred to as computerreadable medium). When these instructions are executed by one or moreprocessing unit(s) (e.g., one or more processors, cores of processors,or other processing units), they cause the processing unit(s) to performthe actions indicated in the instructions. Examples of computer readablemedia include, but are not limited to, CD-ROMs, flash drives, RAM chips,hard drives, EPROMs, etc. The computer readable media does not includecarrier waves and electronic signals passing wirelessly or over wiredconnections.

In this specification, the term “software” is meant to include firmwareresiding in read-only memory or applications stored in magnetic storage,which can be read into memory for processing by a processor. Also, insome embodiments, multiple software inventions can be implemented assub-parts of a larger program while remaining distinct softwareinventions. In some embodiments, multiple software inventions can alsobe implemented as separate programs. Finally, any combination ofseparate programs that together implement the processes described hereinis within the scope of the invention. In some embodiments, the programs,when installed to operate on one or more electronic systems, define oneor more specific machine implementations that execute and perform theoperations of the software programs.

FIG. 16 conceptually illustrates an electronic system 1600 with whichsome embodiments of the invention are implemented. The electronic system1600 can be any of the host devices described above. This system can beany of the devices executing any of the processes and/or queuemanagement systems described above. The electronic system 1600 may be acomputer (e.g., a desktop computer, personal computer, tablet computer,server computer, mainframe, a blade computer etc.), phone, PDA, or anyother sort of electronic device. Such an electronic system includesvarious types of computer readable media and interfaces for variousother types of computer readable media. Electronic system 1600 includesa bus 1605, processing unit(s) 1610, a system memory 1625, a read-onlymemory 1630, a permanent storage device 1635, input devices 1640, andoutput devices 1645.

The bus 1605 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 1600. For instance, the bus 1605 communicativelyconnects the processing unit(s) 1610 with the read-only memory 1630, thesystem memory 1625, and the permanent storage device 1635.

From these various memory units, the processing unit(s) 1610 retrieveinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 1630 stores static data and instructions thatare needed by the processing unit(s) 1610 and other modules of theelectronic system. The permanent storage device 1635, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system1600 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 1635.

Other embodiments use a removable storage device (such as a floppy disk,flash drive, etc.) as the permanent storage device. Like the permanentstorage device 1635, the system memory 1625 is a read-and-write memorydevice. However, unlike storage device 1635, the system memory is avolatile read-and-write memory, such a random access memory. The systemmemory stores some of the instructions and data that the processor needsat runtime. In some embodiments, the invention's processes are stored inthe system memory 1625, the permanent storage device 1635, and/or theread-only memory 1630. From these various memory units, the processingunit(s) 1610 retrieve instructions to execute and data to process inorder to execute the processes of some embodiments.

The bus 1605 also connects to the input and output devices 1640 and1645. The input devices enable the user to communicate information andselect commands to the electronic system. The input devices 1640 includealphanumeric keyboards and pointing devices (also called “cursor controldevices”). The output devices 1645 display images generated by theelectronic system. The output devices include printers and displaydevices, such as cathode ray tubes (CRT) or liquid crystal displays(LCD). Some embodiments include devices such as a touchscreen thatfunction as both input and output devices.

Finally, as shown in FIG. 16, bus 1605 also couples electronic system1600 to a network 1665 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofelectronic system 1600 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself

As used in this specification, the terms “computer”, “server”,“processor”, and “memory” all refer to electronic or other technologicaldevices. These terms exclude people or groups of people. For thepurposes of the specification, the terms display or displaying meansdisplaying on an electronic device. As used in this specification, theterms “computer readable medium,” “computer readable media,” and“machine readable medium” are entirely restricted to tangible, physicalobjects that store information in a form that is readable by a computer.These terms exclude any wireless signals, wired download signals, andany other ephemeral or transitory signals.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figures(including FIGS. 6-9) conceptually illustrate processes. The specificoperations of these processes may not be performed in the exact ordershown and described. The specific operations may not be performed in onecontinuous series of operations, and different specific operations maybe performed in different embodiments. Furthermore, the process could beimplemented using several sub-processes, or as part of a larger macroprocess.

We claim:
 1. For an electronic device that comprises a network interfacecard (NIC) with a plurality of queues for temporarily storing datatraffic through the NIC, a method of managing the queues, the methodcomprising: assigning a subset of data traffic to a set of queues;monitoring the subset of data traffic through the set of queues; andbased on the monitoring, modifying the set of queues.
 2. The method ofclaim 1, wherein modifying the set of queues comprises assigning a newqueue to the set of queues when data traffic through at least a subsetof the set of queues exceeds a maximum threshold amount.
 3. The methodof claim 2, wherein the subset of queues includes all the queues in theset of queues.
 4. The method of claim 2, wherein the subset of queuesdoes not include all the queues in the set of queues.
 5. The method ofclaim 1, wherein modifying the set of queues comprises removing aparticular queue from the set of queues when the data traffic throughthe particular queue is below the minimum threshold amount.
 6. Themethod of claim 1, wherein modifying the set of queues comprisesremoving a particular queue from the set of queues when the data trafficthrough the particular queue is below the minimum threshold amount for aduration of time.
 7. The method of claim 1, wherein monitoring datatraffic comprises monitoring data traffic associated with addressablenodes executing on the electronic device;
 8. The method of claim 1,wherein assigning the subset of data traffic comprises directing thesubset of data traffic to the set of queues.
 9. The method of claim 8,wherein directing the subset of data traffic comprises specifying a setof filters in the NIC to route the subset of data traffic through theset of queues.
 10. The method of claim 9, wherein the set of filtersroute the subset of data traffic through the set of queues to a set ofaddressable destination nodes executing on the electronic device. 11.The method of claim 8, wherein directing the subset of data trafficcomprises specifying a set of filters that route the subset of datatraffic from a set of addressable source nodes executing on theelectronic device out of the electronic device through the set ofqueues.
 12. The method of claim 11, wherein the set of filters isdefined in a network layer that shares a set of networking resources onthe electronic device with multiple addressable source nodes.
 13. Themethod of claim 11, wherein said assigning, directing, monitoring andmodifying are operations performed by a network virtualization layerthat shares a set of networking resources on the electronic deviceamongst multiple different virtual modules, wherein the set of filtersis defined in the network virtualization layer to assign data trafficfrom different virtual modules to different queues in the plurality ofqueues.
 14. For an electronic device that comprises a network interfacecard (NIC) with a plurality of queues for temporarily storing datatraffic through the NIC, a non-transitory machine readable mediumstoring a program for managing the queues, the program comprising setsof instructions for: assigning a subset of data traffic to a set ofqueues; monitoring the subset of data traffic through the set of queues;and based on the monitoring, modifying the set of queues.
 15. Themachine readable medium of claim 14, wherein the set of instructions formodifying the set of queues comprises a set of instructions forassigning a new queue to the set of queues when data traffic through atleast a subset of the set of queues exceeds a maximum threshold amount.16. The machine readable medium of claim 15, wherein the subset ofqueues includes all the queues in the set of queues.
 17. The machinereadable medium of claim 15, wherein the subset of queues does notinclude all the queues in the set of queues.
 18. The machine readablemedium of claim 14, wherein the set of instructions for modifying theset of queues comprises a set of instructions for assigning a new queueto the set of queues when data traffic through at least a subset of theset of queues exceeds a maximum threshold amount for a duration of time.19. The machine readable medium of claim 14, wherein the set ofinstructions for modifying the set of queues comprises a set ofinstructions for removing a particular queue from the set of queues whenthe data traffic through the particular queue is below the minimumthreshold amount.
 20. The machine readable medium of claim 14, whereinthe set of instructions for assigning the subset of data trafficcomprises a set of instructions for specifying a set of filters in theNIC to route the subset of data traffic through the set of queues. 21.The machine readable medium of claim 20, wherein the set of filtersroute the subset of data traffic through the set of queues to a set ofaddressable destination nodes executing on the electronic device. 22.The machine readable medium of claim 14, wherein the set of instructionsfor assigning the subset of data traffic comprises a set of instructionsfor specifying a set of filters that route the subset of data trafficfrom a set of addressable source nodes executing on the electronicdevice out of the electronic device through the set of queues.